Skip to content

feat(agent): persistent caching for agent#59

Merged
varonix0 merged 11 commits into
mainfrom
daniel/agent-manage-leases
Nov 18, 2025
Merged

feat(agent): persistent caching for agent#59
varonix0 merged 11 commits into
mainfrom
daniel/agent-manage-leases

Conversation

@varonix0
Copy link
Copy Markdown
Member

Description 📣

This PR adds persistent caching to the agent (specifically for kubernetes). The agent will now be able to persist dynamic secret leases in between restarts from Kubernetes. Access tokens are not stored in the cache, instead they continue to exist in sinks like they always have. Sinks are files.

This is a prerequisite for adding hand-over support to the Agent Injector, so the agent injector will keep managing the same dynamic secret leases when the user has configured the agent injector to run both init and sidecar mode at the same time.

Minor:

  • Added support for setting LOG_LEVEL environment variable instead of having to pass --log-level.
  • Added yellow color to WARN logs. Previously warnings had the same color as errors (red).
  • Added IsDevelopment() funtion so we don't have to check the CLI_VERSION variable directly to determine if we're in development mode or not

Type ✨

  • Bug fix
  • New feature
  • Improvement
  • Breaking change
  • Documentation

Tests 🛠️

# Here's some code block to paste some code snippets

@varonix0 varonix0 self-assigned this Nov 16, 2025
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Nov 16, 2025

Greptile Overview

Greptile Summary

This PR adds persistent caching for agent dynamic secret leases using Badger DB with AES-GCM encryption, enabling the agent to recover leases across Kubernetes pod restarts. Minor improvements include LOG_LEVEL environment variable support and yellow coloring for warning logs.

Key Changes:

  • Persistent cache implementation using Badger v3 database with AES-GCM encryption
  • Cache encryption key derived from SHA256 hash of Kubernetes service account token
  • Dynamic secret leases stored in cache with TTL matching lease expiration
  • Cached leases verified against API on agent startup; expired/invalid leases revoked
  • Unused leases cleaned up after all templates render once
  • Added LOG_LEVEL environment variable support as fallback for --log-level flag
  • Updated warning log color from red to yellow

Critical Security & Logic Issues Found:

  • Encryption key weakness: Encryption key derived from the same service account token that grants filesystem access to the cache, creating circular security dependency
  • Race condition: Lease verification in GetCachedLease occurs outside mutex protection, allowing concurrent operations to access stale/invalid leases
  • Hardcoded expiry buffer: 2-minute buffer causes valid leases to be prematurely revoked during frequent restarts
  • Insufficient validation: Cache decryption lacks bounds checking on ciphertext size, enabling potential DoS attacks
  • Missing filesystem security: Default Kubernetes token path not validated for existence, permissions, or symlink attacks

Recommendations:

  • Derive cache encryption key from separate source (Kubernetes secret, env var) or use envelope encryption
  • Extend mutex protection to cover entire lease verification sequence
  • Make expiry buffer configurable or use the defined constant consistently
  • Add ciphertext size limits and filesystem validation for token paths

Confidence Score: 2/5

  • This PR introduces critical security vulnerabilities and race conditions that could compromise dynamic secret leases and cause operational issues.
  • Score of 2 reflects multiple critical security and logic issues: weak encryption key derivation creates circular security dependency, race conditions in lease verification could serve stale credentials, hardcoded expiry buffers cause premature revocation, and missing input validation enables potential attacks. While the core caching functionality is implemented correctly, these issues pose significant security and reliability risks in production Kubernetes environments.
  • Pay close attention to packages/cmd/agent.go (encryption key derivation at line 248, race condition at lines 512-540, hardcoded buffer at line 543) and packages/util/cache/cache-storage.go (ciphertext validation at line 248)

Important Files Changed

File Analysis

Filename Score Overview
packages/cmd/agent.go 2/5 Added persistent caching for dynamic secret leases with encryption. Critical security issues: encryption key derived from same token protecting cache access, race conditions in lease verification, hardcoded expiry buffers.
packages/util/cache/cache-storage.go 3/5 New encrypted storage implementation using Badger DB with AES-GCM encryption. Issues: insufficient ciphertext validation, no bounds checking on ciphertext size.
packages/api/api.go 4/5 Added CallGetDynamicSecretLeaseV1 function to retrieve lease details and ErrNotFound error for 404 responses. Implementation looks correct.
main.go 4/5 Updated log formatting to change WARN level color from red to yellow. Added custom color mapping for all log levels.
packages/cmd/root.go 5/5 Added support for LOG_LEVEL environment variable as fallback when --log-level flag not provided. Clean implementation.

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

14 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

Comment thread packages/cmd/agent.go
Comment thread packages/cmd/agent.go Outdated
Comment thread packages/cmd/agent.go
Comment thread packages/util/cache/cache-storage.go
Comment thread packages/cmd/agent.go
Comment thread packages/cmd/agent.go
Comment thread packages/cmd/agent.go
Comment thread packages/cmd/agent.go
Comment thread packages/cmd/agent.go
Comment on lines +529 to +549
// lease not found in API, delete it from cache and return nil
if errors.Is(err, api.ErrNotFound) {
log.Warn().Msgf("dynamic secret lease does not exist, deleting from cache: [lease-id=%s]", leaseFromCache.LeaseID)
if err := d.DeleteLeaseFromCache(leaseFromCache.ProjectSlug, leaseFromCache.Environment, leaseFromCache.SecretPath, leaseFromCache.Slug, leaseFromCache.TemplateID); err != nil {
log.Warn().Msgf("[cache]: unable to delete lease from cache: %v", err)
}

return nil
}

// lease is found in cache but not in the the API, and the API returned a non 404-error. We should attempt to revoke it
// at this point we know that we should be able to reach the API because we've done authentication successfully
log.Warn().Msgf("unable to get dynamic secret lease from API. Revoking lease from cache: [lease-id=%s]", leaseFromCache.LeaseID)
if err := d.DeleteLeaseFromCache(leaseFromCache.ProjectSlug, leaseFromCache.Environment, leaseFromCache.SecretPath, leaseFromCache.Slug, leaseFromCache.TemplateID); err != nil {
log.Warn().Msgf("[cache]: unable to delete lease from cache: %v", err)
}

if err := revokeDynamicSecretLease(accessToken, leaseFromCache.ProjectSlug, leaseFromCache.Environment, leaseFromCache.SecretPath, leaseFromCache.LeaseID); err != nil {
log.Warn().Msgf("unable to revoke dynamic secret lease %s: %v", leaseFromCache.LeaseID, err)
return nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are deleting it anyway, so the if should be to revoke or not:

Suggested change
// lease not found in API, delete it from cache and return nil
if errors.Is(err, api.ErrNotFound) {
log.Warn().Msgf("dynamic secret lease does not exist, deleting from cache: [lease-id=%s]", leaseFromCache.LeaseID)
if err := d.DeleteLeaseFromCache(leaseFromCache.ProjectSlug, leaseFromCache.Environment, leaseFromCache.SecretPath, leaseFromCache.Slug, leaseFromCache.TemplateID); err != nil {
log.Warn().Msgf("[cache]: unable to delete lease from cache: %v", err)
}
return nil
}
// lease is found in cache but not in the the API, and the API returned a non 404-error. We should attempt to revoke it
// at this point we know that we should be able to reach the API because we've done authentication successfully
log.Warn().Msgf("unable to get dynamic secret lease from API. Revoking lease from cache: [lease-id=%s]", leaseFromCache.LeaseID)
if err := d.DeleteLeaseFromCache(leaseFromCache.ProjectSlug, leaseFromCache.Environment, leaseFromCache.SecretPath, leaseFromCache.Slug, leaseFromCache.TemplateID); err != nil {
log.Warn().Msgf("[cache]: unable to delete lease from cache: %v", err)
}
if err := revokeDynamicSecretLease(accessToken, leaseFromCache.ProjectSlug, leaseFromCache.Environment, leaseFromCache.SecretPath, leaseFromCache.LeaseID); err != nil {
log.Warn().Msgf("unable to revoke dynamic secret lease %s: %v", leaseFromCache.LeaseID, err)
return nil
}
if err := d.DeleteLeaseFromCache(leaseFromCache.ProjectSlug, leaseFromCache.Environment, leaseFromCache.SecretPath, leaseFromCache.Slug, leaseFromCache.TemplateID); err != nil {
log.Warn().Msgf("[cache]: unable to delete lease from cache: %v", err)
}
// only attempt to revoke if the lease exists (not a 404 error)
// if it's 404, the lease doesn't exist so there's nothing to revoke
if !errors.Is(err, api.ErrNotFound) {
log.Warn().Msgf("unable to get dynamic secret lease from API. Attempting to revoke lease: [lease-id=%s]", leaseFromCache.LeaseID)
if err := revokeDynamicSecretLease(accessToken, leaseFromCache.ProjectSlug, leaseFromCache.Environment, leaseFromCache.SecretPath, leaseFromCache.LeaseID); err != nil {
log.Warn().Msgf("unable to revoke dynamic secret lease %s: %v", leaseFromCache.LeaseID, err)
}
} else {
log.Warn().Msgf("dynamic secret lease does not exist in API: [lease-id=%s]", leaseFromCache.LeaseID)
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I'm following, what we want is to is:

If 404:
Delete from cache, don't even try to revoke. Just return right away

If not 404:
Delete from cache, and try to revoke

I feel like both are trying to achieve the same, but I feel like what we have currently reads a bit easier?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indentation on GitHub was horrible 😑 Let me paste a screenshot.

We would move from these 2 repeated blocks:

image

To one at the beginning, because the deletion always happens:

image

The idea is just to avoid a bit of code duplication.

Comment thread packages/cmd/agent.go
Comment on lines +530 to +548
if errors.Is(err, api.ErrNotFound) {
log.Warn().Msgf("dynamic secret lease does not exist, deleting from cache: [lease-id=%s]", leaseFromCache.LeaseID)
if err := d.DeleteLeaseFromCache(leaseFromCache.ProjectSlug, leaseFromCache.Environment, leaseFromCache.SecretPath, leaseFromCache.Slug, leaseFromCache.TemplateID); err != nil {
log.Warn().Msgf("[cache]: unable to delete lease from cache: %v", err)
}

return nil
}

// lease is found in cache but not in the the API, and the API returned a non 404-error. We should attempt to revoke it
// at this point we know that we should be able to reach the API because we've done authentication successfully
log.Warn().Msgf("unable to get dynamic secret lease from API. Revoking lease from cache: [lease-id=%s]", leaseFromCache.LeaseID)
if err := d.DeleteLeaseFromCache(leaseFromCache.ProjectSlug, leaseFromCache.Environment, leaseFromCache.SecretPath, leaseFromCache.Slug, leaseFromCache.TemplateID); err != nil {
log.Warn().Msgf("[cache]: unable to delete lease from cache: %v", err)
}

if err := revokeDynamicSecretLease(accessToken, leaseFromCache.ProjectSlug, leaseFromCache.Environment, leaseFromCache.SecretPath, leaseFromCache.LeaseID); err != nil {
log.Warn().Msgf("unable to revoke dynamic secret lease %s: %v", leaseFromCache.LeaseID, err)
return nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: We could do all this release and revoke process in a go routine to unblock the function from async operations.

It doesn't help anything in the rate limiting situation, but in all other cases we could take advantage of it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm how is that? When we return nil from GetLease, it is handled as the lease doesn't exist and then a new lease will be provisioned. If we moved it to a goroutine we may face race conditions and cases where GetLease gets a lease even though it doesn't exist? It could be counteracted with mutation locks and waits but it wouldn't speed up the execution and would make things more complicated to follow

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do the GetDynamicSecretLease call outside the go routine. The routine only runs to delete and revoke.

We always return nil when the GetDynamicSecretLease returns an error or if the lease expires.

image

So we can call the go routine and return nil right after.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like:

image

Copy link
Copy Markdown
Contributor

@victorvhs017 victorvhs017 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved because this is time sensitive, if we have time to do the suggestions nice, if not, we can revisit later!

Comment thread packages/cmd/agent.go
}

// we call appendUnsafe because we already hold the lock, and if we call Append directly we'll get a deadlock
d.appendUnsafe(*leaseFromCache)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: appendUnsafe always saves the cache to the file:

image

But in this case, we just got the cache from the file:

image

And we don't update this value, which means that we are just saving it again. Saving into cache is IO operation, we could avoid this extra IO with a boolean flag in the appendUnsafe function to save in the persistent cache or not.

@varonix0 varonix0 merged commit cfce105 into main Nov 18, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants